Applying Data Mining Techniques for Traffic Incident Analysis
نویسندگان
چکیده
Computer-based simulation and visualization tools had helped to evaluate new algorithms for incident detection and strategies for incident management. Applications of such tools enable users to evaluate solutions faster than real-time. However, it seems that there is still scope to better investigate situations like traffic incidents and study their impacts, more thoroughly by employing techniques such as data mining. A better understanding of the impacts of an incident helps analysts to design more appropriate incident management strategies. Very little is known to date about the usefulness of applying data mining in traffic and transport related research, although data mining has contributed its usefulness significantly in fields like business and space science. In this paper, how an incident situation can be investigated using data mining is explored and described, thereby demonstrating the opportunities that data mining could offer in better understanding the situation. Data used for investigation is obtained from simulation using PARAMICS and the tool employed for data mining is DataScope. INTRODUCTION Incidents in urban areas are known to contribute between 50 and 60% of the total congestion delay (FHWA, 2000). While it is true that managing an incident situation effectively is one of the key challenges that traffic management authorities face on a daily basis, an effective use of advancement in communication and computing technologies has resulted in managing incidents successfully with lesser impacts. The opportunities that the computer-based simulation software offers nowadays to analyze incident situations have never been better before. Many researchers have employed simulation to evaluate incident detection algorithms and incident management strategies (Filippo et al., 2001; Sheu et al., 2000; Sawaya et al., 2000). Some popular microscopic simulation tools are CORSIM, INTEGRATION, PARAMICS, VISSIM, AIMSUN and DRACULA. Applications of these tools assist analysts to accomplish quicker or even faster than realtime solutions. However, it seems there is still scope to better investigate an incident situation more thoroughly using data mining techniques. As the name indicates, data mining aims to mine the data for any hidden treasure, here in the form of new knowledge, which is not readily feasible to accomplish while employing other tools for investigation such as even simulation. A better understanding of the impacts of an incident helps analysts to design more appropriate incident management strategies. Very little is known about the usefulness of applying data mining techniques (hereafter, data mining) in traffic and 1 Department of Civil Engineering, National University of Singapore, Singapore 117576 Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 91 transport related research, although there are numerous applications of data mining in other fields of research, especially business and space science. However, many researchers have figured out that the role data mining played in dealing with a mass of transportation data, and the advantages of applying data mining to retrieve or analyze the useful data (Scherer et al., 1999; Smith et al. 2001; Interstate 4 ITS, 2000; Dailey et al, 2001; Chua et al, 2001). Recent studies that adopted data mining cover performance measures of project management (Papernik et al, 2000), transportation oriented simulation system (Arentze et al, 2000), pavement performance modeling (Attoh-Okine, 2001), demand prediction (Keuleers et al, 2001; Wet et al, 2000), travel time estimation (Choi, 2001), and logistics (Regan, 2001), etc. while applying data mining in incident management is forming a conception (Tamoff et al, 2001; Chan et al, 1998; ADUS, 2000). Necessarily, the first and foremost requirement of applying data mining is the data, which is fulfilled by the availability of huge data that authorities have and/or by generating large sets of data from simulation of incidents. In this paper, it will be described how an incident situation can be investigated using a data mining tool, thereby demonstrating the opportunities that data mining could offer in the field of traffic research. Data to be used for investigation is obtained from simulation of an incident using PARAMICS (Quadstone Limited, 2000) and the data mining tool employed is DataScope (Cygron Private Limited, 2001). First, a short overview on data mining is given. This is followed by a brief description of the simulation environment adopted for generating data. The use of data mining is then described in detail. AN OVERVIEW OF DATA MINING According to Groth (2000), the meaning of the term “data mining” is open to debate. Groth defines data mining as a process of identifying hidden patterns and relationships within data. Hand (1998) describes what is new about data mining when compared to conventional statistics methodology, as follows: “Statistics... might be described as being characterized by data sets which are small and clean, which permit straightforward answers via intensive analysis of single data sets, which are static, which were sampled in an IID manner, which were often collected to answer the particular problem being addressed, and which are solely numeric. None of these apply in the data mining context”. Interestingly, data collected from sources like the loop detectors for general traffic management purposes seems to be qualified for analyzing using special tools such as the data mining to search for any hidden knowledge. Such data are generally huge, dynamic, not always clean because of loops that go faulty and not solely numeric. Data mining analysis can adopt one or more approaches such as genetic algorithms, neural networks, decision trees and clustering and others. However, a convenient way to adopt data mining analysis is to use a software program that hosts facilities to mine the data in a variety of ways. DataScope is one among such popular programs (Mendonca, 1999; Aldana, 2000; High-Tech Heights, 2000) and is chosen here for investigating incident data. DataScope provides users with a convenient visualization of data in up to six dimensions simultaneously. This is to give an insight of the data analyzed visually. Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 92 Furthermore, the program hosts a variety of data mining techniques to employ, such as decision trees, cluster analysis, relation finder, spectrograms and others (Smith et al, 2001; Dombi, 1998). The availability of different methods is important when one is uncertain about which method may lead one to find an interesting and useful pattern in the data mined. SIMULATION ENVIRONMENT Use of simulation data for incident detection and incident analysis research has been a convenient and cost effective approach (Abdulhai et al, 1998; Taggart, 1999). A variety of measures of effectiveness can be used to measure the performance of traffic operation when an incident is created as and when required during simulation. For the current exercise, PARAMICS (Abdulhai et al, 1999; Thomas, 2000; Lee 2001) suite of microscopic simulation tools was employed to generate data. PARAMICS provides users with an Application Programming Interface (API) that can be used to customize the simulation environment and data generation process. Figure 1: Study area In this study, data for analysis with data mining were generated from simulation of a realworld incident situation. The network simulated for this purpose represents approximately 50 sq. km area in Singapore, with Ayer Rajah Expressway running across the region modeled (see Figure 1). Two major arterials run along either side of the expressway. There are 8 exits and 11 entries on the expressway stretch modeled. The Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 93 expressway has three lanes on each direction of traffic, in addition to a service lane. Arterial streets have three and four lanes in each direction. There are 34 signalized intersections and all are demand responsive and coordinated. Loop detectors are placed along the expressway at an approximate spacing of 600 meters. Related data such as network and links details signal operation and flow data have been collected from the Land Transport Authority of Singapore. Data generated for each detector cover traffic count, flow, headway, occupancy, speed, and density. The sampling interval is 60 seconds. Simulation is carried out for duration of 3 hours. For the current exercise, results from one simulation run are used and it is believed sufficient as the interest is on the application of data mining to investigate an incident situation. With this setup, the resulting database hosts approximately 13440 records and 8 fields. DATA MINING ANALYSIS The data generated from simulation were analyzed using DataScope using two approaches. In the first approach, a built-in algorithm known as relation finder was employed to find all strong relationships in the database. The program provides a list of pairs and triples ranked based on the strength of relationship. Then, the analyst can visually analyze those data pairs or triples for any interesting or new patterns. In the second approach, another built-in tool known as decision support is used to arrange the data in the form of clusters based on the comparability of data. Fuzzy C-Means (FCM), which is a widely applied clustering method (Giles, 2001; Looney, 1997; NSCP, 2003) and is a function inside Datascope, was used to cluster the data in this research. Again, the clusters can be viewed graphically for further understanding of the data. In the following sections, the results of using the above two approaches are presented in detail. Investigation using 5-D Stack Bar Chart Many sets of variables showed stronger relationship among its member variables, however for brevity, the relation between headway, speed and occupancy is chosen and illustrated using a five dimensional stacked bar chart (see Figure 2). The horizontal and vertical axes represent time and space respectively. Space is represented by a series of detectors and time is represented by means of intervals. Each interval is of one-minute duration. The diagram comprises stacked rectangles, where each rectangle depicts the state of a detector at a given interval, with height and color representing respectively the headway and speed averaged for vehicles that passed over the detector during the interval. Width of rectangles at any time interval represents system performance measured in terms of delay summed up for vehicles passed over all detectors during that interval, allowing users to monitor both overall system performance and effect at the location of individual detectors. In Figure 2 (a), note that the width of each stack of detectors shows an increase at time T1 and regains the original width at time T2. The duration measured between T1 and T2 stands as the time when the overall system performance is affected by incident. Notice Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 94 that the speed on the detector denoted as “DET 140” abruptly decreases at time T1. This is the location where the incident occurs. The series of detectors shown above the DET 140 represent those detectors located on the downstream of the incident. By looking into the speed reductions for different detectors (shown by colors), the time that the incident impact reaches individual detectors varies. This indeed represents a form of the backward forming shock wave from the incident location. Figure 2(a): Time and distance 5D stacked bar chart Another observation is on the headway (height of rectangles). Note that at time T1, for detectors above DET 140, there is a sharp increase in the headway. These detectors are located downstream of the incident, and hence, where there is a sudden flow reduction caused by the incident. For the same detectors, the headway appears to reduce notably at time T3. This is the instant when the incident is cleared. In Figure 2 (b), by observing the speed (denoted by color) variations among detectors located on the upstream of the incident, two distinct triangular areas are marked (ABC and ADC). The Area ABC represents the condition of queue flow and Area ADC represents the condition of discharge flow. The rest in the diagram represents normal flow condition. Assume that there are imaginary lines spanning between A and C (ωAC), Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 95 A and D (ωAD), B and C (ωBC), C and D (ωCD). Those lines would represent four kinds of shock wave types: • ωAC: Backward recovery shock wave • ωAD: Frontal stationary shock wave • ωBC: Backward forming shock wave • ωCD: Forward recovery shock wave Figure 2(b): Time and distance 3D stacked bar chart In this case, the frontal stationary shock wave occurs at detector DET 136, therefore, the area around detector DET 136 is the bottleneck after incident cleared. The point of intersection of the frontal stationary shock wave and the forward recovery shock wave indicates the termination of congestion, hence the whole system regains normal situation after time T2. Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 96 Figure 3: Time, count, and flow 3D scatter diagram Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 97 Investigation using Cluster Analysis The results of clustering by flow, speed, density, and time (hereafter, FSD∫) are given from Figures 3 to 5. There are three main groups based on the clustered data: normal situation, recovery stage, and incident situation as shown in Figure 3. Both of FSD∫ 1 and FSD∫ 5 belonging to normal situation represent the general traffic condition without being influenced by incident. Recovery stage denoted by “FSD∫ 3” shows the increasing of traffic counts and the pattern that the traffic regains normal situation after incident being removed. Incident situation includes two clusters, FSD∫ 2 and FSD∫ 4. The flows of FSD∫ 2 and FSD∫ 4 are notably lower, but the speed and density of them are extremely different. The data of FSD∫ 4 which possesses low flow, low speed and high density come from the detectors located upstream of the incident. The data that come from downstream detectors of the incident are clustered as FSD∫ 2 by the characters of low flow, high speed, and low density. Figure 4: FSD∫ and count 2D discrete spectrogram Figure 4 shows how the clusters distribute with respect to traffic counts. All of them have a bell shape except FSD∫ 3. The shape of FSD∫ 3 approximates to a rectangle, and that means the data of FSD∫ 3 may follow a uniform distribution. Another information here is that the speed of FSD∫ 4 is extremely low for it’s Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 98 colored in blue while others are colored in red. It’s helpful to distinguish the data belonging to the upstream detectors of the incident (FSD∫ 4) from the data belonging to the downstream detectors of the incident (FSD∫ 2). Figure 5: Detector ID and FSD∫ colored by time 2D stacked bar chart The relations among FSD∫, density, detectors, and time are illustrated in Figure 5. FSD∫ 4 represents higher density and lower speed. FSD∫ 2 represents lower density and higher speed. The density of FSD∫ 3 and FSD∫ 4 on detector DET 134 and DET 136 are alike, and that indicates the density remains higher on those two detectors although the incident has been removed. However, the density of FSD∫ 3 on detector DET 138 and DET 140 are lower and are similar to the density of downstream detectors. Again, the area around detector DET 134 and DET 136 is the bottleneck after the incident was cleared. The growth of backward forming shock wave and backward recovery shock wave can be found by observing the data of FSD∫ 4 and FSD∫ 3 shifting by time. On the upstream side of incident, the detector to where the backward forming shock wave reaches farthest is DET 122, because the FSD∫ 4 appears only from detector DET 122 to detector DET 140. The detector that the backward recovery shock wave reaches Journal of The Institution of Engineers, Singapore Vol. 44 Issue 2 2004 99 farthest is DET 120, because the density of FSD∫ 3 on detector DET 120 increases at the recovery stage. CONCLUSION AND NEEDS FOR FUTURE RESEARCHIn this paper, an application of data mining in investigating an incident situation ispresented. From the study, it appears that data mining provides the opportunity to betterunderstand the impacts of an incident, and stands as a convenient tool for the abovepurpose. The case demonstrated in this paper showed that using data mining enables theuser can demark the impact area of the incident temporally and spatially. Incident dataare divided into clusters according to flow, density and time relationships. This alsohelps the user to visualize the distribution of data under a specific cluster or category.The information represented via visualization aid can assist Traffic Management Center(TMC)/traffic engineers to take necessary actions at the occurrence of an incident. Inaddition, visualization tools help to identify any hidden relationship among traffic data.Future direction is to apply data mining to the development of a comprehensive incidentmanagement expert system that enables better and swift incident management as well asoptimum traffic diversion measures. REFERENCESAbdulhai, B., Sheu, J. B. and Recker, W. W. (1998). Simulation of ITS on Irvine FOT AreaUsing the PARAMICS 1.5 Scalable Microscopic Traffic Simulator, Phase I: Model Calibration and Validation. Technical Report, Institute of Transportation Studies,University of California, Irvine. ADUS (2000). Archived Data User Service: An Addendum to the ITS Program Plan: FinalVersion 3, ITS Americahttp://www.itsa.org/committe.nsf/364ace963601e0e8852565d70069ea76/a30b7b6010c4a2b78525663c005b7d51!OpenDocument Aldana, W. A. (2000). Mining Industry: Emerging Trends and New Opportunities, MITHigh-Tech Heights: The IT Industry’s Top Innovators Bring Home the Awards, PCAlmanac, 4: 4 205-208, 2000,http://www.smartcomputing.com/editorial/article.asp?article=articles%2Farchive%2Fr0404%2F37r04%2F37r04%2Easp Arentze, T. A. and Timmermans, H. J. P. (2000). ALBATROSS – A Learning-BasedTransportation Oriented Simulation System,http://www.infra.kth.se/tlenet/meet5/papers/Timmermans2.pdf Attoh-Okine, Nii O. (2001). Combining Use of Rough Set and Artificial Neural Networksin Doweled Pavement Performance Modeling – A Hybrid Approach. In Proceedings 80Annual Meeting of TRB (CD-ROM), Washington, D.C. Journal of The Institution of Engineers, SingaporeVol. 44 Issue 2 2004 100Barcelo, J., E. Bernauer, L. Breheret, G. Canepari, C. D. Taranto, J. Ferrer, K. Fox, J.Gabard and R. Liu. (1999). Simulation Report. SMARTEST/D6. Institute for TransportStudies, University of Leeds. Chan, S., Chang, E., Lin, W. H. and Skarbardonis, A. (1998). Data Utilization at CaliforniaTransportation Management Centers, UCB-ITS-PRR-98-34. California PATH ResearchReport. California PATH Program, Institute for Transportation Studies, University ofCalifornia, Berkeley. Choi, K. and Chung, Y. S. (2001). Travel Time Estimation Algorithm Using GPS Probe andLoop Detector Data Fusion. In Proceedings 80 Annual Meeting of TRB (CD-ROM),Washington, D.C. Chua, K. M., Mckeen, G., Burge, J. and Luger, G. (2001). A Virtual Environment forTransportation Data Management System. In Proceedings 80 Annual Meeting of TRB(CD-ROM), Washington, D.C. Dailey, D. J. and Pond, L. (2001). TDAD: An ITS Archived Data User Services (ADUS)Data Mine. In Proceedings 80 Annual Meeting of TRB (CD-ROM), Washington, D.C. Dombi, J. (1998). Cognitive Aspects of Data Mining. DataScope: A Visualisation Tool anda Visual Query System, Cygron Research & Development Ltd.http://www.inf.u-szeged.hu/~dombi/Luxemburg%20KESDA%202.doc FHWA (2000). Traffic Incident Management Handbook. FHWA-SA-91-056, U. S.Department of Transportation. Filippo, L., Rindt C. R., McNally, M. G. and Ritchie, S. G. (2001). TRICPS / CARTESIUS:An ATMS Testbed Implementation for the Evaluation of Inter-Jurisdictional TrafficManagement Strategies. In Proceedings 80 Annual Meeting of TRB (CD-ROM),Washington, D.C. Giles, D. E. A. and Draeseke, R. (2001). Econometric Modeling Based on PatternRecognition via the Fuzzy C-Mean Clustering Algorithm, Working Paper EWP0101,Department of Economics, University of Victoria. Groth, R. (2000). Data Mining: Building Competitive Advantage, Prentice-Hall, Inc.Hand, D. J. (1998). Data Mining: Statistics and More, The American Statistician, 52:2 112-118. Interstate 4 ITS (2000). Interstate 4 Intelligent Transportation System – Architecture,Reynolds, Smith and Hills, Inc., http://www.i4-its.com/Phase%20II%20Report/Sec4.pdf Keuleers, B., Wets, G., Aremtze, T. and Timmermans, H. (2001). Using Association Rulesto Identify Spatial-Temporal Patterns in Multi-Day Activity Diary Data. In Proceedings80 Annual Meeting of TRB (CD-ROM), Washington, D.C. Journal of The Institution of Engineers, SingaporeVol. 44 Issue 2 2004 101Lee, D-H and Yand, X. (2001). Parameter Calibration for PARAMICS Using GeneticAlgorithm. In Proceedings 80 Annual Meeting of TRB (CD-ROM), Washington, D.C. Looney, C. G. (1997). Pattern Recognition Using Neural Networks, Oxford UniversityPress. Luk, J. Y. K., Akcelik, R., Bowyer, D. P. and Brindle, R. E. (1983). Appraisal of EightSmall Area Traffic Management Models. Australian Road Research, 13:1 25-33. Mendonca, M. and Sunderhaft, N. L. (1999). Mining Software Engineering Data: A Survey.DACS-SOAR-99-3. A DACS State-of-the-Art Report. DoD Data & Analysis Center forSoftware (DACS), Rome, NY. NSCP (2003). Fuzzy C-means Clustering algorithm, National Scalable Cluster Project:University of Pennsylvania. http://nscp01.physics.upenn.edu/fmri/fuzzy.html Papernik, D. K., Nanda, D., Cassada, R. O. and Morris, W. H. (2000). An InformationStrategy to Enable the Analysis of Performance Measures. In Proceedings 79 AnnualMeeting of TRB (CD-ROM), Washington, D.C. Quadstone Limited (2000). PARAMICS User Guide – Version 3,Edinburgh, UKCygron Private Limited, (2001). DataScope User’s Manual – Version 4, Singapore Regan, A. C. and Song, J. (2001). An Industry in Transition: Third Party Logistics in theInformation Age, In Proceedings 80 Annual Meeting of TRB (CD-ROM), Washington,D.C. Sawaya, O. B., Doan, D. L. and Ziliaskopoulous, A. K. (2000). A Predictive Time BasedFeedback Control Approach for Managing Freeway Incidents. In Proceedings 79 AnnualMeeting of TRB (CD-ROM), Washington, D.C. Scherer, W. T. and Smith, B. L. (1999). The Development of Integrated IntelligentTransportation System (IITS). In Proceedings 78 Annual Meeting of TRB (CD-ROM),Washington, D.C. Sheu, J. B. and Ritchie, S. G. (2000). A Sequential Detection Approach for Real-TimeFreeway Incident Detection and Characterization. In Proceedings 79 Annual Meeting ofTRB (CD-ROM), Washington, D.C. Smith, B. L., Scherer, W. T. and Hauser, T. A. (2001). Data Mining Tools for the Supportof Traffic Signal Timing Plan Development. In Proceedings 80 Annual Meeting of TRB(CD-ROM), Washington, D.C. Taggart, B. T. (1999). Incorporation Neural Network Traffic Prediction into FreewayIncident Detection, Morgantown WV. Journal of The Institution of Engineers, SingaporeVol. 44 Issue 2 2004 102Tamoff, P. and Christiansen, D. (2001). Moving Toward A National Agenda ofTransportation Operations and Mobility Research: Final Drafthttp://www.nationalacademies.org/trb/publications/rtforum/national_operations_agenda.pdf Thomas, K. and Dia, H. (2000). A Neural Network Model for Arterial Incident DetectionUsing Probe Vehicle and Fixed Detector Data. In Proceedings of the 22nd Conference ofAustralian Institutes of Transport Research (CAITR 2000), Australian National University,Canberra, ACT, Australia. Wet, G., Vanhoof, K., Aremtze, T. and Timmermans, H. (2000). Identifying DecisionStructures Underlying Activity Patterns: An Exploration of Data Mining Algorithms. InProceedings 79 Annual Meeting of TRB (CD-ROM), Washington, D.C.
منابع مشابه
Behavioral Analysis of Traffic Flow for an Effective Network Traffic Identification
Fast and accurate network traffic identification is becoming essential for network management, high quality of service control and early detection of network traffic abnormalities. Techniques based on statistical features of packet flows have recently become popular for network classification due to the limitations of traditional port and payload based methods. In this paper, we propose a metho...
متن کاملPredicting the Next State of Traffic by Data Mining Classification Techniques
Traffic prediction systems can play an essential role in intelligent transportation systems (ITS). Prediction and patterns comprehensibility of traffic characteristic parameters such as average speed, flow, and travel time could be beneficiary both in advanced traveler information systems (ATIS) and in ITS traffic control systems. However, due to their complex nonlinear patterns, these systems ...
متن کاملSurvival Analysis of Urban Traffic Incident Duration: a Case Study at Shanghai Expressways
Traffic incident duration is one of the most important parameters to describe traffic congestion intensity of expressways. Numerous measures have been developed to describe the characteristics of traffic incidents and the vast majority of these studies use data mining. Though they inform us the objective elements of the environment may influence traffic incident, the how question-via what way t...
متن کاملClassification of encrypted traffic for applications based on statistical features
Traffic classification plays an important role in many aspects of network management such as identifying type of the transferred data, detection of malware applications, applying policies to restrict network accesses and so on. Basic methods in this field were using some obvious traffic features like port number and protocol type to classify the traffic type. However, recent changes in applicat...
متن کاملThe Assessment of Applying Chaos Theory for Daily Traffic Estimation
Road traffic volumes in intercity roads are generally estimated by probability functions, statistical techniques or meta-heuristic approaches such as artificial neural networks. As the road traffic volumes depend on input variables and mainly road geometrical design, weather conditions, day or night time, weekend or national holidays and so on, these are also estimated by pattern recognition te...
متن کامل